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Abstract 

This paper proposes a reconfigurable model to recognize 
and detect multiclass (or multiview) objects with large vari¬ 
ation in appearance. Compared with well acknowledged 
hierarchical models, we study two advanced capabilities 
in hierarchy for object modeling: (i)“ switch ” variables( i.e. 
or-nodes) for specifying alternative compositions, and (ii) 
making local classifiers (i.e. leaf-nodes) shared among dif¬ 
ferent classes. These capabilities enable us to account well 
for structural variabilities while preserving the model com¬ 
pact. Our model, in the form of an And-Or Graph, com¬ 
prises four layers: a batch of leaf-nodes with collaborative 
edges in bottom for localizing object parts; the or-nodes 
over bottom to activate their children leaf-nodes; the and- 
nodes to classify objects as a whole; one root-node on the 
top for switching multiclass classification, which is also an 
or-node. For model training, we present an EM-type al¬ 
gorithm, namely dynamical structural optimization (DSO), 
to iteratively determine the structural configuration, (e.g., 
leaf-node generation associated with their parent or-nodes 
and shared across other classes), along with optimizing 
multi-layer parameters. The proposed method is valid on 
challenging databases, e.g., PASCAL VOC 2007 and UIUC- 
People, and it achieves state-of-the-arts performance. 

1. Introduction 

Object recognition is an area of active research in com¬ 
puter vision, and its performance has been improved sub¬ 
stantially in recent years [6, 19, 9, 12, 4, 16]. The objective 
of this work is to develop a novel hierarchical and reconfig¬ 
urable model for multiclass object recognition, in the form 
of an And-Or graph representation, as Fig. 1 illustrates. We 
study two following issues that are often ignored or over¬ 
simplified in previous works. 
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Figure 1. An example of the proposed 4-layer And-Or graph model 
for multiclass object recognition. Parts of the model for sheep and 
horse are shown. The squares in bottom represent the leaf-nodes, 
which can be shared among different classes(e.g. the leaf-node for 
localizing legs are shared between sheep and horse). The or-nodes 
over bottom are used to activate their children leaf-nodes, tackling 
the appearance variability. 

Model reconfigurability. One key challenge in object 
modeling is to capture the large object variation in appear¬ 
ance and view/pose. Some recently proposed deformable 
part-based models [6, 19] handle this challenge by using 
hierarchical and contextual compositions, and achieve re¬ 
markable progresses. However, the structural configura¬ 
tions of these models are mainly fixed, e.g., the number of 
part detectors and the ways of composition. Inspired by 
And-Or graph models in [13, 27, 7, 23], we develop the 
“switch variables”, namely or-nodes, to specify alternative 
compositions in hierarchy. In detection, the or-nodes are 
used to activate its children leaf-nodes (i.e. local classi¬ 
fiers), accounting for intraclass variance. It worths mention¬ 
ing that the association of or-nodes with its children leaf- 
nodes can be automatically determined in model training. 
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In Fig. 1, the sheep head is localized by the leaf-node that 
is activated by its parent or-node. 

Model sharing. In the context of multiclass object 
recognition, existing systems commonly treat different 
classes as unrelated entities. According to acknowledged 
studies [20, 18, 16], sharing information among different 
classes can boost model performance in general and alle¬ 
viate the requirement of a large amount of training data. 
Recently, Salakhutdinov et al. [17] propose a leaming-to- 
share framework that allows rare objects to borrow statis¬ 
tical strength from other related classes, and demonstrate 
impressive results. It inspires us to make structure shared in 
the And-Or graph model, for adapting the task of multiclass 
recognition. In our method, the leaf-nodes are sharable 
among different classes so that we keep the model com¬ 
pact to represent multiple object categories. For example, 
in Fig. 1, the part of feet in category horse and sheep have 
similar appearances, and thus can be both detected by the 
leaf-node shared across the two classes. 

The key contribution of this work is a novel And-Or 
graph model for multiclass object recognition, by address¬ 
ing the both above issues. Without loss of generality, we 
define our four layered model, as Fig.l illustrates. The leaf- 
nodes (denoted by squares) in the bottom are discriminative 
classifiers for detecting object parts. The or-nodes (denoted 
by dashed circles) over in the third layer are used to acti¬ 
vate one of its children leaf-nodes in detection, which are 
allowed to slightly perturb for capturing deformations. The 
and-nodes (denoted by solid circles) in the second layer are 
global classifiers for object classes. The root-node at top 
is for switching multiclass recognition, which is also an or- 
node. In addition, we define the collaborative edges (de¬ 
noted by curve connections) to encode intraclass (part-level) 
relations, and interclass contexts are modeled in the similar 
way as the edges connect the and-nodes also. 

One non-trivial problem in model training is to automat¬ 
ically determine the model structure without requiring elab¬ 
orate supervision and initialization. In our method, we pro¬ 
pose a novel algorithm for this problem, namely Dynamical 
Structural Optimization (DSO), motivated by the recently 
proposed structural optimization methods [24, 11]. It is de¬ 
signed in the EM-type iterating with three steps, (i) Esti¬ 
mate model latent variables for optimization, according to 
parameters from the previous iteration, (ii) Reconfigure the 
model structure by clustering. In this step, we produce leaf- 
nodes associated with their parent or-nodes and make leaf- 
nodes shared across classes, (iii) Check the acceptance for 
the newly generated model structure, and update the model 
parameters. 

Due to large variance among classes, it would be in¬ 
tractable to train the classes altogether by pooling all sam¬ 
ples from different classes into a bag. In this work, we first 
partition all classes into several groups by a data-driven ap¬ 


proach, in order to reduce the computational complexity 
for model sharing. Then we train the models for object 
classes in each group. For example, we can easily decide 
to put sheep and horses into one group and train the multi¬ 
class model by sharing. Afterwards, the trained models for 
all groups are further combined into the complete one, by 
reweighing parameters of all the models. And the collabo¬ 
rative edges are also learned during this step. 

2. Related Work 

Traditional multiclass object detectors are trained in a 
one-vs-all manner, where each object category are trained 
independently. These methods often rely on large amount 
of training data. A pioneer work [20] is proposed to learn 
shared features among classes and improve the classifier in 
both effectiveness and efficiency. Opelt et al. [14] further 
incorporate the incremental learning with classifier sharing. 
To discover hierarchical structures of object categories, the 
Hierarchical Latent Dirichlet Allocation (hLDA) model is 
presented in [1 ]. The efficiency can be significantly im¬ 
proved by integrating taxonomies with object hierarchy [8]. 

To tackle realistic challenges in object recognition, many 
deformable part-based methods are developed by latent 
structural learning recently [6, 26, 19]. These models are 
also extended to multiclass recognition and detection [4, 17, 
16, 15]. For example, Razavi et al. [16] present the multi¬ 
class Hough Forest combing with the part-based models; 
Desai et al. [4] further incorporate the context information 
into hierarchy, and predict a structured labeling for each im¬ 
age during detections. However, the commonly used part- 
based models are often defined in a tree structure, whose 
configurations are fixed during the learning and detection, 
and may have problems on handling objects with large ap¬ 
pearance and structure variations. 

And-Or graph models are first proposed for modeling 
complex visual patterns by Zhu and Mumford [25]. Its gen¬ 
eral idea, using And/Or nodes to account for structural com¬ 
positions and variabilities in hierarchy, has been applied in 
several vision tasks, e.g., human parsing [23, 27] and object 
modeling [13]. These approaches often require supervised 
learning or manually initialization. Fidler et al. [7] propose 
to train the And-Or graph for multiclass shape-based detec¬ 
tion in a generative way, and extensively discuss the learn¬ 
ing strategies. Motivated by these works, we propose an 
alternating way to discriminatively train the And-Or graph 
model for multiclass object recognition, and achieve supe¬ 
rior performances. 

3. And-Or Graph Model 

Our multiclass object model is constructed in the form 
of an And-Or graph Q = (V,£), where V contains three 
types of nodes, £ represents the collaborative edges. The 
root-node is indexed as 0, indicating the switch among 


classes. The and-nodes are indexed by r = 1, ...,m, each 
representing one category. For each and-node, there are 
9 or-nodes arranged in a layout of 3 x 3 blocks to rep¬ 
resent object parts, and we index all the or-nodes as j = 
m - hi,..., 10m. The leaf-nodes in the fourth layer are in¬ 
dexed by i = 10m+l,..., lOm+n, where n is the leaf-node 
number dynamically adjusted during training. For notation 
simplicity, we define that m' = 10m + 1, n' = 10m + n, 
and i G ch(j ) indexes a child node of node j. The details 
of our model are presented as follows. 

Sharable Leaf-node: The leaf-nodes Li,i = m', ...,n' 
are local classifiers for object parts, and they can be shared 
among different classes. Specifically, if a leaf-node is affil¬ 
iated to the j-th or-node, it is also possible to be shared by 
the or-nodes in other classes indexed by j + 9 x k, where 
k G {1,2...}. We denote the location of leaf-node Li as 
Pi , which is determined by its parent or-node activating Li 
during inference. The response of Li is defined as, 

R l l {X ) P i )=u J l i -4>\X,P i ). (1) 

In our implementation, a HOG [3] pyramid is built across 
different image scales as in [6]. f l (X , Pi) is the HOG fea¬ 
ture extracted from image X at position Pi , and uj\ is a pa¬ 
rameter vector. 

Or-node: The or-nodes Uj, j = m + 1,..., 10m in the 
third layer are “switch” variables to select (activate) their 
children. For each leaf-node Li, we define an variable 
Vi G {0,1} to represent the activation during inference. 
An indicator vector is then composed for each or-node Uj : 
v i = where i k e ch(j) and V,|| = 1/0. 

Note that ||Vy|| = 1 only when one of the leaf-nodes is 
activated under Uj . The response of Uj is thus defined as, 

K“(X,P j ,Y j )= Y, R i(X,Pj) ■ Vi, (2) 

iech(j) 

where Pj denotes the position of Uj , and it is allowed to per¬ 
turb slightly during inference. We define a feature for ob¬ 
ject deformation as <p s (P r , Pj ) = ( dx , dy , dx 2 , dy 2 ), where 
(dx, dy) represents the displacement of Uj relative to its an¬ 
chor position that is determined by the position of its parent 
P r . The response of the deformation is defined as, 

R s j (P r ,P j )=Lj s j -<t> s (P r ,P j ). (3) 


Figure 2. Illustration of the features for defining collaborative 
edges, (a) shows feature ^(P^P*/) between leaf-nodes; (b) 
shows feature f> a (P r , P r ') between and-nodes. 


And-node: The and-nodes A r ,r = 1, ...,m are global 
classifiers for objects. Suppose A r is placed at P r dur¬ 
ing detection, we extract the HOG feature for the and-node 
( t> a (X , P r ) at half the resolution of the feature extracted for 
leaf-nodes. We define the response for A r with its parame¬ 
ters u/^, as, 

R a r (X,P r )=u a r -f a (X,P r ). (4) 

Root-node: The root-node on the top is an or-node for 
switching different classes, i.e. choosing its children and- 
nodes. Similarly with defining the or-nodes, for each and- 
node A r , we also define the activation for it as V r G {0,1}, 
and the indicator vector for root-node is Vo = (V\ ,..., V m ) 
and 11 Vo || = 1, i.e, only one children is selected. 

Collaborative edge: There are two types of collab¬ 
orative edges in our model, representing the spatial co¬ 
occurrence between different leaf-nodes as well as between 
different and-nodes. For the collaborative edges between 
leaf-nodes, we introduce a 4-bin binary feature 'll) 1 (Pi, Pi'). 
Each bin of ^(P^P^) represents one of the relations: 
clockwise, anti-clockwise, near and far between two leaf- 
nodes Li and L^. As Fig. 2(a) illustrates, the bold rectan¬ 
gle in the middle represents the location of Li . If the center 
of Li> is localized in the dotted rectangle, it is near the Li, 
otherwise it is far from Li. We connect the initial centers of 
Li and Li> with the dashed line and the red line represents 
their layout after accounting for deformation. Then we use 
two bins to indicate either clockwise or anti-clockwise for 
the angle between the dashed line and the red line. We thus 
define the response of the collaborative edge between two 
leaf-nodes as, 

T[ i ,(P i ,P i ,) = a{ i ,-i> l (P i ,P il ), (5) 

where a\ if is a 4-bin parameter vector. Motivated by [4], 
we define a 6-bin binary feature f> a (P r ,P r ') representing 
the contextual relations: above, below, beside, overlap, near 
and far between two objects. As Fig. 2(b) illustrates, the 
bold rectangle in the middle represents the window of A r . 
And the dashed and dotted rectangles represent the bins to 
be set as 1 if the center of A r > is inside. The response of the 
collaborative edge between two and-nodes is defined as, 

Y a r y(P r ,P r ') =a“y ^ a (P r ,P r '), (6) 

where a “ r , is a 6-bin parameter vector. In practice, we only 
connect the two leaf-nodes whose parent or-nodes are adja¬ 
cent to each other in spatial domain. And the and-nodes are 
connected across classes. 

4. Inference 

Given an image, the task for inference is to localize all 
the multiclass objects with the model. For simplicity, we 
notate the vector of selections for and-nodes together with 
leaf-nodes as V = (Vi, ..., V m , V m ', ..., V n '), and the vector 
of placements as P = (Pi,..., Piom)- 

A subgraph of the And-Or graph, rooted at one of the 
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and-nodes, can be regarded as a detector for one class. For 
each subgraph, we compute its scores by sliding the detec¬ 
tion sub-window at different positions and scales of the im¬ 
age. It is a procedure integrating the local testing and bind¬ 
ing testing as follows. 

Local testing: For a subgraph model rooted at A r (i.e., 
V r — 1) and placed at P r of the image, we assume a hypoth¬ 
esis ¥ for leaf-node selections. Then the placement of each 
part can be obtained by incorporating Eq.(2) and Eq.(3): 

Pj = max(_R“ (X, Pj , V,-) - R]{P r ,Pj)) 

= max( Y, RliX^^-Vi-RKPr,^)), (7) 

j iech(j) 

where R\ (X, Pj ) represents the leaf-node response, and 
we can share these responses among different classes by 
calculating them at the beginning of inference. Then the 
score of local testing is calculated as: 

S l r (X, P,V)= Y ( R U x ,Pj’Vi)-Rj(Pr,Pj))- (8) 

jEch(r) 

Binding testing: We obtain the response over the and- 
node Ra(X, P r ) with Eq.(4). And for each hypothesis ¥, 
we compute the scores of intra-class contextual relations be¬ 
tween the selected leaf-nodes via Eq.(5). Then the binding 
score is calculated as: 

n' 

S a r{X, P,V) = R a r (X,P r ) + ^ (P*, -FV ) • Vi • V*, (9) 

ip =m' 

where the leaf-node location is set as Pi = Pj for i G 
ch(J), ||Vi|| = 1. By integrating these two procedures, we 
select the best ¥ as the score of detection via the subgraph 
rooted at A r : 

S 9 (X, r, P r ) = max(5*(X,P,V) + 5“(X,P,¥)). (10) 

After the detections for all subgraphs, we can represent 
the image as a collection of K scored sub-windows, over¬ 
lapping at different scales. Our objective is to label them 
with!" = {yi, ..., Hk}, where yi G {1,..., m} represents 
object classes and yi = — 1 the background. The multiclass 
detection score in the image can be defined by combining 
Eq.(10) and Eq.(6), 

S(X) = max(f^ S 9 (X, y k ,P k ) + (P k ,P k ')), (11) 

k = l k,k' = 1 

where P k indicates the position of the kth sub-window of 
the image. The optimization of Eq.(ll) can be solved by 
the greedy forward search mentioned in [4]. We define 
a instance set INS = {(k,y)} indicating that y^ = y for 
V(fc, 2/) G INS and yk = 0 otherwise. Then the greedy 
method is performed as Algorithm 1 . 

5. Dynamical And-Or Graph Learning 

The training of our And-Or graph is a two stages proce¬ 
dure: (i) estimating model structure (without edges) and pa- 


Algorithm 1 Greedy Forward Inference 

Input: 

The detections scores S 9 ( X , r, P r ) for image sub-windows. 

Output: 

Instance set INS, detection score S. 

Initialization: 

INS = {}, S = 0, 5(k, y) = S 3 {X,y,P k ) fork e {1 

repeat 

1. (k,y) = argmax (fej3/) 5(fe, j/) forV(fc, y) £ INS. 

2. INS = INS U (k,y)- 

3. S = S + 6(k,y). 

4. 6(k,y)=6(k,y) + r“~(P k ,P k ) + rZ y (P k ,P k ). 
until S(k,p) < 0 and S stops increasing. 


rameters for each object group; (ii) combining models and 
learning collaborative edges. 

To reduce the computational cost for model sharing, we 
first divide the object classes into several groups as a data- 
driven initialization, and train the multiclass model for each 
group. Afterwards, we combine the trained models together 
to construct the final And-Or graph model. 

The learning for stage (i) is an EM-type procedure in¬ 
corporating structure reconfiguration and parameter estima¬ 
tion. During each iteration, our algorithm dynamically cre¬ 
ate and remove leaf-nodes associated with their parent or- 
nodes, and share leaf-nodes among classes. More precisely, 
a leaf-node is created to better handle the intra-class vari¬ 
ance (Fig. 3(b)); A leaf-node is removed if there is another 
similar one (Fig. 3(d)); A leaf-node is shared as it can cap¬ 
ture the similar appearances for other classes (Fig. 3(c)). 

5.1. Data Driven Initialization 

Suppose the number of all classes is M, we partition 
them into several groups as a data driven initialization for 
training. The partition is based on the similarity between 
two classes, and we calculate the similarity as follows. 

(I) We first learn a two-layer deformable part-based 
model [6] T k = {Xf} for all classes, where T k repre¬ 
sents one part classifier for k -th class. And we apply T k to 
perform detection on the positive training samples in every 
class. During the detection, each T k extracts a set of image 
patches from different samples, and we group these patches 
into a cluster Q k . Note that the size of image patches de¬ 
tected by T k is (h k , wf). For all fl k , we further merge them 
into a few new sets, each of which contains image patches of 
similar size (h k ,w k ). In each of the new sets, we describe 
the image patches with the HOG descriptor and group them 
into several clusters by using ISODATA algorithm with Eu¬ 
clidean distance. 

(II) Afterwards, a matrix M is defined to represent the 
similarity between M classes. In each set of image patches, 
if there are patches from class j and k falling into the same 
cluster, we set A4(j, k) <— M.(j , k) + 1. Two classes j and 
k are assumed to share their models if M(j,k) > a, where 






Figure 3. Dynamical Structural Optimization. Parts of the multi-class model for sheep and horse are illustrated in different iterations, (a) 
The model structure after the first iteration; (b) A new leaf-node is created to recognize the head of sheep; (c) A leaf-node for sheep leg is 
shared with the horse; (d) A leaf-node for horse leg is removed. 


<j is a threshold set as M/3 empirically. 

(Ill) Based on the calculated A4, we assign the classes 
that are possibly shared into the same group S. We thus 
obtain a few groups as {Si,§ c }. We denote that each § 
has \S>\ classes, and we discuss the training method for each 
S in the following section. 


5.2. Optimization Formulation 

Given an object group S, we train a multiclass model 
without collaborative edges, which is a procedure inte¬ 
grating structure reconfiguration and parameter estima¬ 
tion. Suppose there are a set of N training samples 
(X 1 ,y 1 ),...,(X N ,y N ) in S, where X is the image, y e 
{1,..., \E>\} labels the object classes, and y = — 1 labels the 
background. At the beginning of training, we initialize the 
multiclass model with m = \B>\ and-nodes and one leaf- 
node for each or-node. The detection score of this model 
can be represented as the maximization of Eq.(10) over m 
and-nodes, by setting edge parameters to zero, 

S\X)= max S 9 (X, r, P r ) 




Vi V V CUj ■ cf> S (P r ,Pj) ' ||Vj 


r=1 j£ch(r) 


+ j2^-r(x,p r )-v r ), 


( 12 ) 


where the first two terms represent the response of local 
testings, and the last term is the and-node response. For 
simplicity, we refer H = (P, ¥) as the latent variables, then 
we redefine Eq.(12) in a discriminative form as, 

S U (X) = argmax {y , H) (uj • <l>(X,y,H)), (13) 

where uj includes the complete model parameters of current 
model, 0(X, t/, H) is defined as, 

«*».»>={ r H) . a*. 

and (j)(X, H ) is the overall feature vector. 

The function (13) can be learned by applying structural 
SVM with latent variables, 


1 ^ 

iin-|H| 2 + C y'fmax^ • 4>(X k ,y, H) + C{y k ,y )) 

W Z f—' y,H 

k= 1 

- max(w • ^(X k ,y k ,H))], 


(15) 


where C is a penalty parameter set as 0.005 empirically, 


and we define the loss function C(y k ,y) = 0 when y k = y , 
and C(y k ,y) *■= 1 if y k ^ y. In recent works [26, 10], the 
CCCP [24] method is applied to solve the non-convex opti¬ 
mization, which provides an iterative approach to achieve a 
local minima. However, in these methods, the model struc¬ 
ture configuration is assumed to be fixed, e.g., without or- 
nodes. Motivated by these works, we propose a Dynamical 
Structural Optimization (DSO) method to train out model. 

5.3. Dynamical Structural Optimization 

To optimize the objective Eq.(15), we transform it into a 
concave and convex form following [26], 

1 N 

min Io IMI 2 + c E m ax(w • <P( x k,y, H) + C(y k ,y))\ 

u) Z z ' y,H 

k= 1 

N 

- C max((j • <j)(X k ,yk,H)) (16) 

k =i 

= min [f(cj) - g(u)], (17) 

UJ 

where the first two terms in (16) are represented by f(uS) 
and g{uj) is the other term. Then we present our 3-step Dy¬ 
namical Structural Optimization method as follows. 

(I) Suppose we are in the iteration t, and uj t is the param¬ 
eter vector updated in the previous iteration. We first find a 
hyperplane q t to upper bound —g(uj) in (17), 

— g(w) < —g(idt) + (cu — cot) • Qt: Vcj. (18) 

_ We calculate q t by finding the optimal latent variables 
H k = argmax H (uj t • (/>(X k ,y k ,H)). That is, we ap¬ 
ply the current model to perform detections on the train¬ 
ing samples, and the Jiyperplane is constructed as q t = 

-CEti^ x k,yk:H k ). 

(ID We adjust the model by structural reconfiguration 
and sharing, and it is performed on each one of the 9 object 
parts over m classes, independently. Given a variable vector 
H k for a sample, we can obtain the activation of leaf-nodes 
and the image patches detected via them. For each leaf-node 
Li , we group the patches detected via it from all samples 
into a cluster D*, and the size of these patches is ( hi,Wi ). 

We index the nine object parts by j (m +1 < j < m + 9). 
For the j-th part, we pool the clusters whose corresponding 
leaf-nodes are associated to or-nodes Uj+ g k (0 < k < m) 


















from m classes together. Then these clusters are further 
merged into a few new sets, each of which contains patches 
of similar size ( h^Wi ). For each new set, we describe the 
image patches with HOG descriptor and perform clustering 
on them by applying ISODATA with Euclidean distance. 

After the clustering, the leaf-nodes are reconfigured as: 
If a cluster is newly generated, we create a new leaf-node 
accordingly; we remove a leaf-node if there are few im¬ 
age patches in the corresponding cluster. For a cluster f 
if there are images patches localized by Ujim step(I)), we 
associate the leaf-node Li to Uj . Thus Li is shared by dif¬ 
ferent classes for different associations. 

The feature vector 0 of each sample is also adjusted 
according to the clustering result. Recall that the HOG 
vector of an image patch is part of 0, and the patches in 
the same cluster are represented with the same bins in 0. 
We present a toy example in Fig. 4 for illustration. The 
sub-vector (0 5 ,..., 0 8 ) of sample X 3 is grouped from one 
cluster to another; then the feature bins are moved from 
(05,..., 0s) to (0i,..., 0 4 ), as (a) and (c) shows. We de¬ 
fine the new feature vector for each sample after clustering 
as 0 d (A0, 2 //., Hk), then the new hyperplane in step (I) is 
reconstructed as qf = -C J2k=i ^(Xk, Vk, H k ). 

(HI) With the current model structure and qf we can 
learn the model parameters by solving, 

uf argmin u (f (uj) + uj ■ qf). (19) 

By substituting f(uS) with the first two terms defined in 
Eq.(16), we can re-write Eq.(19) as, 

1 N 

inin-||cc|| 2 tCrWu • 4>(X k ,y, H) + C{y k ,y )) 

W Z f—' y,H 

k =1 

-u-<t> d (X k ,y k ,H k )]. (20) 

The optimization of Eq.(20) can be solved by standard 
structural SVM. After that, we can calculate the energy of 
the objective by E(uf) = f(ojf) — g{wf). 

If E(ujf) < E(u t ), we accept the new model structure 
and have uj t +i = wf. Otherwise, we keep the model con¬ 
figuration as it is in the previous iteration, and continue to 
perform parameter optimization without structure reconfig¬ 
uration as Eq.(19): ou t +i = argmin UJ (f(uj) + uj • q t ). 

In this way, we ensure the optimization objective in 
Eq.(17) continuing to decrease in iterations. Thus, the al¬ 
gorithm keeps iterating until the objective converges. 

5.4. Model Combination 

After training the multiclass models for each object 
group in {§!,...,§ c }, we combine them together into a 
complete one for all the object categories. Intuitively, the 
root-nodes from each group are first merged into the final 
top root-node, so that the original and-nodes are all associ¬ 
ated to the new root-node. Then we introduce a n' dimen¬ 
sion vector 0 = (0i, ...,0 n /) to re-weight the parameters 
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Figure 4. A toy example of feature adjustment according to struc¬ 
tural clustering. Parts of 4 feature vectors associated to two 
different leaf-nodes are presented, (a)shows the feature vectors 
generated after Step (I), whose value is indicated by the inten¬ 
sities of bins; (b)shows the structural re-clustering: The fea¬ 
ture (05,..., 08) of X 3 are moved from Cluster 2 to Cluster 1; 
(c)updates the feature vectors according to clustering results. 


for the newly generated model. Meanwhile, the collabora¬ 
tive edges defined in Eq.(5) and Eq.(6) are trained as well. 

For simplicity, we shorten the responses for leaf-node, 
or-node deformation and and-node as R\(k) = R\(X , P/ 0 ), 
R:]{k) = R s j (P k ,P J k ) and R a yk = R a yk {X,P k ). Given an 
image X, the objective function S{X) of multiclass recog- 
nition defined in Eq.(l 1) is reformulated as, 

K n' 

m # x a • ^( fc ) • v ? - Eft • + ft* • R y k 

k = l i=m' j£ch(y k ) 

+ ^ai i ,^\P k ,P k ,)-V k Vt+f2< k ,y k rr{P\P k ')], 

iji'—m' k'= 1 

where the first two terms represent the local testing score, 
the next two represent the binding testing score, and the last 
one accounts for edge responses between and-nodes. 

For the training, we collect a set of images contain¬ 
ing multiclass objects, each of which is labeled with Y = 
{yi , ..., 2 /x}. Given each image, we first obtain the latent 
variables Hk with Eq.(13) by fixing y^, and the responses 
for each part are derived meanwhile. We can then use the re¬ 
sponses R l i(k),R S j(k) and Ry k as part of the input feature, 
and train the parameters 0, a 1 and a a by standard struc¬ 
tural SVM. Here the loss function for training is defined as 
£'(Y, Y f ) = 1C — tp, where 1C indicates the number of ob¬ 
jects in groundtruth Y, and tp is the number of true positives 
in Y' according to Y. 

Afterwards, the parameters for leaf-nodes = 

ra', ...,n'), or-node deformations (Uj , j = m + 1,..., 10m) 
and and-nodes (A r ,r = 1, ...,m) are re-weighed as: uj\ = 

Pi • uj\, ujj = 0j • UJ S - and ujf = 0 r • cof. 
























Ours (full) 

Ours(sim) [23] 

[i] 

[6] 

[2] 

Accuracy 0.845 

0.818 

0.668 

0.506 

0.486 

0.458 


Table 1. Detection accuracies on UIUC people dataset. 


6. Experiments 

We evaluate our method on two challenging datasets: 
UIUC people [21] and PASCAL VOC 2007 [t ]. 

Dataset and Setting. The UIUC people dataset contains 
593 images(296 for training, 297 for testing), and most of 
them contain one person playing badminton. For PASCAL 
VOC 2007 dataset, there are 9963 images of 20 object cat¬ 
egories with 5011 images for training and 4952 images for 
testing. In both datasets, we represent each object category 
with two views, i.e. each object category is specified by 
two and-nodes in our model. Hence, we perform 2-class 
recognition on UIUC people dataset, and 40-class recog¬ 
nition on PASCAL VOC 2007 dataset. During evaluation, 
we adopt PASCAL Challenge criterion: a detection is con¬ 
sidered as correct only if the intersection over union with 
the groundtruth bounding-box is at least 50%. All our ex¬ 
periments are carried out on a PC with Core Duo 3.0 GHZ 
CPU and 16GB memory. We denote our fully implemented 
model as “Ours(full)”, since we will simplify the model in 
different settings for empirical study. 

6.1. Experimental Results 

UIUC people dataset. For model training, it takes 11 
iterations and around 6 hours to converge in optimization. 
And the time for detection on a image is about 5 seconds. 
We compare our model with the state-of-the-arts human de¬ 
tectors [23, 1, 2, 6], some of which used manually labeled 
model. The detection accuracy is calculated as [23]: only 
the detection with the highest score on the image is con¬ 
sidered. As Table. 1 reports, our approach reaches the de¬ 
tection accuracy of 84.5%, outperforming other methods. 
Moreover, we demonstrate the advantage of our model for 
handing object variations in detection in Fig. 6. We visual¬ 
ize the detectors generated by our trained model in the form 
of HOG patterns. The detectors for and-nodes and leaf- 
nodes are shown in Fig. 6 (a). Note that some of the leaf- 
nodes are shared for capturing similar appearances. Two 
detectors, composed by 9 activated leaf-nodes, are visual¬ 
ized in Fig. 6 (b). The two detectors are generated when 
recognizing the images beside them. The results show that 
our model can generate alterable detectors to adapt diverse 
object appearances and poses. 

PASCAL VOC 2007 dataset. To train the 40-class model 
on the database, it takes 25 ^ 30 iterations in 30 ^ 34 
hours. On average, it takes 92 seconds for detecting all 20 
classes of objects on one input image. We then calculate the 
average precision (AP) to evaluate our method. As shown 
in Table. 2, our method achieves the mean AP(mAP) of 



Figure 5. Extensive experiments for discussion. “Our-S” indicates 
a simplified model without sharing leaf-nodes, (a) shows the APs 
on UIUC people dataset, (b) represents the leaf-node numbers 
with the increasing of object categories on PASCAL VOC 2007 
dataset. 


34.7%, which is highly competitive to the state-of-the-arts 
methods: 29.0% [9], 29.2% [16], 29.6% [26], 32.1% [22] 
and 26.8% [(]. We also notice that there is a significant im¬ 
provement achieving mAP of 37.7% [19] recently, by em¬ 
ploying multi-kernels classification into detection. 

6.2. Evaluation for Model Sharing 

To analyze the effectiveness of sharing leaf-nodes, we 
disable the process for model sharing in training so that 
we obtain the simplified non-sharing And-Or graph model, 
named “Ours(sim)”. As Table. 1 reports, “Ours(sim)” 
achieve detection accuracy of 81.8%, 2.7% less than the 
fully implemented model. We also compare the APs of 
these two models in Fig. 5 (a), in which the APs are vi¬ 
sualized with the increasing of iteration numbers for model 
training. Each AP for a specific iteration number is obtained 
by testing the model that is trained by the number of itera¬ 
tions. And the APs of “Ours(full)” and “Ours(sim)” achieve 
72.8% and 68.3%, respectively, after 11 iterations. 

We consider the model complexity, represented by the 
number of leaf-nodes, could be effectively reduced by the 
model sharing. Thus, we also present an experiment to 
show the numbers of leaf-nodes in model training, with the 
increasing of object categories, in Fig. 5(b). Precisely, we 
obtain 552 leaf-nodes for 20 object categories on PASCAL 
VOC 2007 dataset, less than 717 leaf-nodes by “Ours(sim)” 
model. 

7. Conclusion 

This paper introduces a novel method for multiclass ob¬ 
ject detection and recognition, in the form of And-Or graph. 
Our model is shown to handle well the challenges in large 
variance object recognition. Moreover, we also illustrate the 
benefits of information sharing among classes, which leads 
to a more compact and better model. Since our learning 
method(SDO) is very general, it can be extended to many 
other vision tasks. 















Figure 6. Visualization of the trained model on UIUC people dataset, (a) shows parts of the model with two classes (views), in which 
we visualize the detectors (in the form of HOG patterns) for the and-nodes and leaf-nodes based on the learned parameters, alone with the 
example images recognized by the model, (b) visualizes two detectors that are composed by 9 activated leaf-nodes. The two detectors are 
generated, respectively, when recognizing the images beside them. The detectors for and-nodes are not visualized here. 



plane 

bicycle bird 

boat 

bottle 

bus 

car 

cat 

chair 

cow 

table 

dog 

horse 

mbike 

person plant 

sheep 

sofa 

train 

tv 

Avg. 

Ours (full) 

32.5 

60.1 

11.1 

16.0 

31.0 

50.9 

59.0 

26.1 

21.2 

26.5 

25.4 

16.4 

61.7 

48.3 

42.2 

16.1 

28.2 

30.1 

44.6 

46.3 

34.7 

MC [9] 

33.4 

37.0 

15.0 

15.0 

22.6 

43.1 

49.3 

32.8 

11.5 

35.8 

17.8 

16.3 

43.6 

38.2 

29.8 

11.6 

33.3 

23.5 

30.2 

39.6 

29.0 

HF [I ] 

26.0 

56.0 

10.0 

11.0 

21.0 

47.0 

50.0 

16.0 

19.0 

23.0 

20.0 

12.0 

51.0 

45.0 

37.0 

12.0 

17.0 

29.0 

41.0 

38.0 

29.2 

LEO [26] 

29.4 

55.8 

9.4 

14.3 

28.6 

44.0 

51.3 

21.3 

20.0 

19.3 

25.2 

12.5 

50.4 

38.4 

36.6 

15.1 

19.7 

25.1 

36.8 

39.3 

29.6 

MKL [2: ] 

37.6 

47.8 

15.3 

15.3 

21.9 

50.7 

50.6 

30.0 

17.3 

33.0 

22.5 

21.5 

51.2 

45.5 

23.3 

12.4 

23.9 

28.5 

45.3 

48.5 

32.1 

UoCTTI [6] 

29.0 

54.6 

0.6 

13.4 

26.2 

39.4 

46.4 

16.1 

16.3 

16.5 

24.5 

5.0 

43.6 

37.8 

35.0 

8.8 

17.3 

21.6 

34.0 

39.0 

26.8 


Table 2. Results on PASCAL VOC 2007. 
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